Samuel Pitoiset 


Supervised by Martin Peres 


GSoC student 2013 & 2014 


October 8, 2014 


Introduction 


Summary 


@ introduction 
e What are performance counters ? 
ə NVIDIA's performance counters 
e Nouveau's performance counters 
ə Proposal 


N 
~ 
N 
N 


Introduction 
e 
ry A ~A its 
if 2 p 


Performance counters 


ə are blocks in modern processors that monitor their activity; 


ə count low-level hardware events such as cache hit/misses. 


Why performance counters are used ? 
ə To analyze the bottlenecks of 3D and GPGPU applications; 


ə To dynamically adjust the performance level of the GPU. 


Introduction 


NVIDIA's performance counters 


Two kind of counters exposed by NVIDIA 


e compute counters for GPGPU applications: 
ə exposed through CUPTI (CUDA Profiling Tools Interface). 

e graphics counters for 3D applications: 

ə exposed through PerfKit, only on Windows... 
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Introduction 


Nouveau 's performance counters 


compute counters support for Fermi and Kepler; 
exposed to the userspace through Gallium-HUD; 
Kepler support by Christoph Bumiller (calim); 

Fermi support by myself (GSoC 2013). 


but many performance counters left to be exposed... | 
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Off-season work 
@ reverse engineered graphics counters using PerfKit on W7. 


Google Summer of Code 2014 


ə expose NVIDIA's graphics counters for Tesla (NV50): 
e kernel interface in Nouveau DRM; 

ə mesa & GL_AMD_performance_ monitor; 
ə nouveau-perfkit. 


Benefits to the community 
ə help developers to find bottlenecks in their 3D applications. 
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Summary 


@ PCOUNTER 
e The performance counters engine 
e Overview of a domain 
ə Other counters ? 


PCOUNTER 
e 


The performance counters engine 


PCOUNTER: General overview 


contains most of the performance counters; 


is made of several identical hardware units called domains; 
each domain has 256 input signals; 
input signals are from all over the card (global counters); 


performance counters are tied to a clock domain. 
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Figure : Example of a simple performance counter 
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Overview of a domain 
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Figure : Schematic view of a domain from PCOUNTER 
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Other counters ? 


Per-context counters (or MP-counters) 

@ per-channel/process counters in PGRAPH; 

@ more accurate than global counters; 

@ same logic as PCOUNTER; 

ə share some in-engine multiplexers with PCOUNTER; 
o 


currently require running an OpenCL kernel to read them. 
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Per-context counters (MP) 


ə all GPGPU signals for Tesla, Fermi and Kepler reversed; 


@ reverse engineered by Christoph Bumiller and myself. 


Global counters (PCOUNTER) 
e very chipset-dependant; 


e more than 200 signals reverse engineered on NV50/Tesla; 


ə work done by Marcin Koscielnicki (mwk) and myself. 


What about graphics counters ? 


ə almost-all 3D signals exported by PerfKit on NV50 reversed; 


@ some per-context counters still need to be reversed. 
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Summary 


(3) Reverse engineering 
e Windows... Kill me now! 
e How does it work? 
e OGL Performance Experiments 


Reverse engineering 
e 


Reverse engineering on Windows... 


ə 3D signals are exposed through PerfKit, only on Windows; 
ə can't use envytools (a collection of NVIDIA-related tools); 


ə ... because libpciaccess doesn't work on Windows! 


Bring it on! 
ə added libpciaccess support for Windows/ Cygwin; 
@ envytools can now be used on Windows; 
ə no MMIO traces and no valgrind-mmt...; 
ə let's start the reverse engineering process. :) 
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Reverse engineering process 
@ configure the hardware counters with PerfKit on W7; 
@ dump the configuration with some tools of envytools: 
ə but some multiplexers are very difficult to find! 
@ regenerate the same result by polling the counters on W7; 
@ reproduce the configuration on Linux/Nouveau; 


@ go to step 1... 


e around 50 graphics counters exposed on Tesla family; 
ə and 14 different chipsets (ouch)! 


OGL Performance Experiments 
e a modified version of OGLPerfHarness (PerfKit); 


e to help in the reverse engineering process. 
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Screenshot of OGLPerfHarness (based on PerfKit) on W7 
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Summary 


@ Kernel interface 
ə Introduction 
@ Synchronization 


e Overview from Mesa’s PoV 
@ Overview from the GPU’s PoV 
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Kernel interface 
e 


Introduction 


Why is a kernel interface needed ? 


e because global counters have to be programmed via MMIO: 
e only root or the kernel can write to them. 


What the interface has to do ? 
@ set up the configuration of counters; 
ə poll counters; 


ə expose counter’s data to the userspace (readout). 
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Synchronizing operations 
ə CPU: ioctls; 
e GPU: software methods. 


Software method 


e command added to the command stream of the GPU context: 
@ upon reaching the command, the GPU is paused; 
ə the CPU gets an IRQ and handles the command. 
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Overview from Mesa's PoV 
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User space 


@) alloc counter object 


L29 


@ get object's handle 


(6) Kernel space 


Notifier BO 
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G) begin monitoring 
(a) end monitoring 

r v (5) get counters' value 
! (6) kernel writes data 


Command 


stream | : time 


T) mesa reads data 
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Overview from the GPU's PoV 


D begin monitoring 
@ configure counters 


(a) end monitoring 
(5) polling counters 


(6) get counters' value 


Notifier BO 
(ring buffer) 


@ write fence ID 


@) (8) copy counters' value 
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How to synchronize different queries ? 


A detailed look at the ring buffer 
ə mesa sends a query ID to read out results; 
ə this sequence number is written at the offset 0: 

ə easy to check if the result is in the ring buffer. 
e the ring buffer queues up 8 queries/frames (like the HUD): 
e avoid stalling the command submission. 
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Figure : Schematic view of the ring buffer 
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Summary 


@ Perfmon APIs 


Performance counters APls 
ə Proprietary: Perfkit, CUPTI, GL_ AMD _ perfmon; 
ə OSS: Gallium HUD only. 


GL_ AMD performance _ monitor 


ə patches available for nvc0, svga, freedreno and radeon drivers; 
@ my patch set (v4) is pending on mesa-dev: 
ə initial work by Christoph Bumiller. 


nouveau-perfkit 
ə a Linux/Nouveau version of NVIDIA Perfkit; 
e built on top of mesa (Gallium state tracker like vdpau); 


ə work in progress. 
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General overview 
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Summary 


@ Conclusion 
@ Questions & Discussions 
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e 


Conclusion 


ə all 3D global counters on Tesla (NV50) reversed; 
ə kernel interface & mesa implementation is on the way: 
ə hope to see the code in Linux 3.20. 


GL_AMD performance monitor's patches are pending. 


ə implement nouveau-perfkit as a Gallium state tracker; 
@ reverse engineer more performance counter signals: 
e graphics counters support for Fermi and Kepler. 


all the work which can be done around performance counters. 
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Conclusion 


Questions & Discussions 


Questions & Discussions 


And for more information you can take a look at my blog 
http://hakzsam.wordpress.com | 


